.. _`K-means Clustering`: .. _`org.sysess.sympathy.machinelearning.k_means`: K-means Clustering `````````````````` .. image:: dataset_blobs.svg :width: 48 Clusters data by trying to separate samples in n groups of equal variance Documentation ::::::::::::: Attributes ========== **cluster_centers_** Coordinates of cluster centers. If the algorithm stops before fully converging (see ``tol`` and ``max_iter``), these will not be consistent with ``labels_``. **inertia_** Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided. **labels_** Labels of each point Definition :::::::::: Output ports ============ **model** model Model Configuration ============= **K-means algorithm** (algorithm) K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`. The `"elkan"` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. .. versionchanged:: 0.18 Added Elkan algorithm .. versionchanged:: 1.1 Renamed "full" to "lloyd", and deprecated "auto" and "full". Changed "auto" to use "lloyd" instead of "elkan". **Initialization method** (init) Method for initialization: * 'k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is "greedy k-means++". It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them. * 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. * If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. * If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. For an example of how to use the different `init` strategy, see the example entitled sphx_glr_auto_examples_cluster_plot_kmeans_digits.py. **Maximum number of iterations** (max_iter) Maximum number of iterations of the k-means algorithm for a single run. **Number of clusters/centroids** (n_clusters) The number of clusters to form as well as the number of centroids to generate. For an example of how to choose an optimal value for `n_clusters` refer to sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py. **Number of runs** (n_init) Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of `n_init` consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see kmeans_sparse_high_dim). When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='k-means++'` or `init` is an array-like. .. versionadded:: 1.2 Added 'auto' option for `n_init`. .. versionchanged:: 1.4 Default value for `n_init` changed to `'auto'`. **Random seed** (random_state) Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See random_state. **Tolerance** (tol) Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. Implementation ============== .. automodule:: node_clustering :noindex: .. class:: KMeansClustering :noindex: